Model Tuning Project 6: Credit Card Users Churn Prediction

Submission Due Date: Sat, Aug 28, 2021

Project Name: Credit Card Users Churn Prediction

Course Group: Group H - PGP-DSBA-UT 2021

Student Name: James Moralez, P.E.

Project Version: 1, submitted 27 Aug 2021

Objective:

To explore and conduct a statistical analysis of the Thera bank customer information data dataset and extract insights using Exploratory Data Analysis (EDA) to build the best possible model that will give the required performance necessary to help the bank improve its services so that customers do not renounce their credit cards

Provide the statistical analysis to complete the following tasks:

  1. Explore and visualize the dataset to identify customers who will leave their credit card services
  2. Build a classification model to predict if the customer is going to churn or not
  3. Optimize the model using appropriate techniques
  4. Generate a set of insights and recommendations that will help the bank improve and retain credit card customers

Data Dictionary:

BankChurners.csv - The cvs file contains the customer base for Thera bank that includes the customer profile and personal information in regard to customer accounts, and contains the following fields:

Notes:

Import and Loading Library Packages

Import and Loading Library Packages

Load and Overview of the Dataset

Loading in the cvs File Dataset, Copy to Second Dataset, Display Dataset Head

Displaying detailed info of each column attribute within the dataset

Coverting Object Data Types Columns into Category Data Types

Searching for duplicate records within the dataset

No duplicate records found within the dataset

Brief evaluuation of CLIENTNUM attribute to determine if needed

Typically, client ID's are unique to each customer, therefore provide no analytical value and can be immediately from the dataset

Counting the number of Duplicate Values within the CustomerID column

No duplicate entries within the CustomerID column

Searching the number of unique entries within the CustomerID column

All 10127 CLIENTNUM column entries are unique

Initial Observations and Insights of Dataset

CLIENTNUM attribute

Removal of CLIENTNUM attribute from the dataset

Checking for Missing Data within Dataset

Missing Data Results

Generate Descriptive Statistics

Use the Describe function to generate descriptive statistics that summarize the central tendency of the numerical attributes

Observations and Insights

Breakdown of the attributes are as follows:

Descriptive summary of category type attributes within the dataset

Observations and Insights

All attributes appear to have reasonable values and range for this portion of the analysis

Breakdown of the attributes are as follows:

Unique Values within the Dataset Attributes

Checking the overall number of unique values in each attribute

Next we will create an attribute list for each data type category

Creating a int64 data type attribute list

Next we will display the unique values for each of the int64 data type attributes

Observations and Insights - Integer Entry Counts

Overall attribute entry counts displayed resonable counts for this portion of the analysis

Breakdown of the attributes are as follows:

Creating a float64 data type attribute list

Display the unique values for each of the float64 data type attributes

Observations and Insights - Float Entry Counts

Breakdown of the attributes are as follows:

Creating a category data type attribute list

Display the unique values for each of the category data type attributes

Observations and Insights - Category Entry Counts

Breakdown of the attributes are as follows:

Data Preprocessing for EDA (Exploratory Data Analysis)

In order to conduct our EDA with this dataset, it will first be necessary to prepare the data for analysis

Data Entry Cleaning

In order to prevent data Leakage in later analysis, we will want to remain aware of any changes to the dataset that could possibly contribute. At the same time, we will need to make adjustments to the data that will allow us to properly use it within this EDA portion of the analysis.

In an effort to maintain the integrety of the dataset, observations with missing (NaN) or "invalid" (abc) entries will be changed to reflect the entry as being "Unknown" for this portion of the EDA.

Replacing entries containing "abc" with "Unknown" in the "Income_Category" attribute

Creating a new EDA dataset

As part of the effort to prevent data Leakage when we do later analysis, missing values will not be treated until the dataset has been properly split. While we would also not want to drop any observations either in preparation of the EDA or later analysis, the missing data will be accounted for, by copying over and creating a new dataset that will handle the missing values for the EDA by changing their "NaN" values to "Unknown"

Copying the dataset into a new EDA dataset

Missing Value Treatments for EDA

Imputing Missing NaN Values with string "Unknown" for EDA

Due to the significant amount of observations missing, rather than delete or impute values, we will change the missing NaN value to the string value "Unknown" to give a more accurate protrayal of the values within the dataset

Verifying ALL missing values have been removed

All missing values have been imputed

Display random sample of 10 rows

Redisplaying detailed information to confirm each column attribute

Observations and Insights

EDA (Exploratory Data Analyisis)

Univariate Analysis of the Attribute

As a result of some numerical data type attributes being of a category type rather than a integer, these attributes will need to be separated out and plotted separately using barplots for a better analysis representation of the data

User defined functions for a Histogram, Bar & Box Plot of numerical variables and categories

User defined function that creates barplots that indicate percentage for each attribute category

User defined function that creates stacked barplots

Separating into a list integer data types with less than 10 unique values

Histogram & Box Plot of Non-Categorical Numerical and Float Variables

Grouping into a list numerical attributes with unique entries greater than 10 values (continuous type attributes)

Boxplot and Histogram loop to call def histogram_boxplot for Continuous Type Attributes

Observations and Insights - Continuous Type Attributes - Univariate Analysis

Grouping into a list numerical attributes with float value data types

Boxplot and Histogram loop to call def histogram_boxplot for attributes with float value data types

Observations and Insights - Float Value Attributes - Univariate Analysis

Bar Charts for Numerical Category Attributes

Grouping numerical category attributes with less than 10 category type values

Barplot with percentage indicators for each category

Observations and Insights - Numerical Category Attributes - Univariate Analysis

Outlier Detection

First taking a look at the each attributes min, max, range, quantile values and outlier values for all numerical variables

Based upon the results above, non-category attributes regardless of the data type, should be further examined using both the numerical and box plot results

Next, we will group all non-categorical attributes into a list, then applying the below boxplot function that highly displays outliers for numerical columns

Grouping non-categorical attributes to evaluate outliers

Use of Boxplot to Check Outliers

Checking the number of customers with high credit limits

Checking the number of customers with high Avg Open To Buy

Checking the number of customers with high Total Amount Change Q4_Q1 Ratios

Observations and Insights - Outlier Analysis

Outlier Treatment:

Bivariate Analysis

In order to conduct a more accurate bivariate analysis, the y-target variable will be converted from a string category data type, into a integer data type, so that it can be used in correlation or covariance calculations.

The analysis will then be focused on attributes in relation to the Attrition_Flag

Changing y-target data type string values and replacing with integer binary values

Use of Pairplot Function using "Attrition_Flag" hue and Regression Bar for Checking Covariances

Observations and Insights - Correlation Heatmap

The result of these strong pairs my be an indicator that at least one attribute from each of the pairs can be dropped from models

Use of the Covariance Table

Observations and Insights - Corvariance Table

Use of General Correlation Heatmap Plot

Calculated correlation values with Respect to 'Attrition_Flag'

Observations and Insights - Correlation Heatmap in Relation to "Attrition_Flag"

Bivariate Analysis Plots

In addition to examing the covariarance and coorelation calculations and plots we will exam, using mostly boxplots, other dataset attributes directly against the Attrition Flag attribute

Observations on "Attrition_Flag" vs "Customer_Age"

Observations on "Attrition_Flag" vs "Months_on_book"

Observations on "Attrition_Flag" vs "Months_Inactive_12_mon"

Observations on "Attrition_Flag" vs "Contacts_Count_12_mon"

Observations on "Attrition_Flag" vs "Credit_Limit"

Observations on "Attrition_Flag" vs "Total_Revolving_Bal", with "Card_Category" as a hue

Observations on a BarPlot of "Attrition_Flag" vs "Total_Revolving_Bal"

The above bar plot displays that customers in the range of 700 to above 1200 are likely to leave Thera Bank's credit card services

Observations on "Attrition_Flag" vs "Total_Trans_Amt"

Observations on "Attrition_Flag" vs "Total_Trans_Ct", with a Card Category Hue

Observation of "Attrition_Flag" and "Total_Ct_Chng_Q4_Q1"

Observation of "Attrition_Flag" and "Avg_Utilization_Ratio"

Data Preparation for Modeling

Back at the ranch...

We will first make a copy of the dataset to create a modeling dataset to work with, so that during the modeling we can to come back and make any changes to our modeling dataset without changing the original dats set

Feature Engineering

Dropping redundant or attributes that will not apply to our predictive models

Since the objective of the modeling is to help the bank improve its services so that existing customers do not renounce their credit cards, we will predict customers who will not renounce their credit cards. As a result, we will drop the columns that will no longer be applicable at the time of prediction for new data.

The attributes "Months_on_Book" and "Ave_Open_To_Buy" will be dropped based upon the strong positive correlations with "Customer_Age" and "Credit_Limit" attributes respectively

The attributes "Months_Inactive_12_mon", "Contacts_Count_12_mon","Total_Amt_Chng_Q4_Q1", "Total_Trans_Amt", "Total_Trans_Ct", "Total_Ct_Chng_Q4_Q1", "Avg_Utilization_Ratio" are calculated attributes of past data

Changing "Unknown" string value into missing data value

Split data

First we will separate the dataset into the target dependant variable and independant variables datasets

Examming the percentage weights between Attrited and Existing customers

Changing y-target data type string values and replacing with integer binary values

Splitting data into training, validation and test sets

We will apply an overall split of the data into a ratio of 60:20:20

Missing-Value Treatment

We will impute missing values into Education_Level, Marital_Status and Income_Category attributes using the SimpleImputer function applying the that will apply the the 'most frequent' strategy option for determining the value of each column.

Building the Model

Model Evaluation Criterion:

Based upon the objective to build a classification model to predict if the customer is going to churn or not, the structure of the model predictions will be as follows:

  1. True Positive - Predicting that a customer will renounce their credit cards and the customer does - Loss of bank income
  2. True Negative - Predicting that a customer will not renounce their credit cards and the customer remains
  3. False Positive - Predicting that a customer is likely to renounce their credit cards and the customer stays
  4. False Negative - Predicting that a customer will not renounce their credit cards and the customer renounces - Loss of bank income

Which case is more important?

How to reduce this loss: i.e need to reduce False Negatives?

Model performance minimums

Matrics

Plotting boxplots for CV scores of all models defined above

Defined function to compute and check performance

Defined function to create Confusion Matrix

Logistic Regression Model

Observations

Random Forest Classifier Model

Observations

Bagging Classifier Model

Observations

Boosting AdaBoost Classifier Model

Observations

Gradient Boosting Classifier Model

Observations

XGBoost Classifier Model

Observations

Model Observations

Next we will examine the Training set result when Oversampling using SMOTE is applied to the Training dataset

Model Building - Oversampling Data using SMOTE

Logistic Regression on oversampled data

Random Forest Classifier on Oversampled Data

Bagging Classifier on oversampled data

AdaBoost Classifier on Oversampled Data

Observations

Gradient Boosting Classifier on Oversampled Data

Observations

XGBoost Classifier on Oversampled Data

Observations

Combining upsampling performance results

OverSampling Observations

Next we will examine the Training set result when Undersampling using RandomUnderSampler is applied to the Training dataset

Undersampling Data using RandomUnderSampler

Logistic Regression on Undersampled Data

Observations

Random Forest Classifier on Undersampled Data

Observations

Bagging Classifier on Undersampled Data

Observations

AdaBoost Classifier on Undersampled Data

Observations

Gradient Boosting Classifier on Undersampled Data

Observations

XGBoost Classifier on Undersampled Data

Combining undersampling performance results

UnderSampling Observations

Tuning 3 Models

Based upon the results from the first set of six models above, although Recall is above 0.95 we will attempt to tune the top three models Random Forest, GBC, and XGB with the highest validation recall scores.

Though already high validation scores, the scores were aas follows:

Random Forest Classifier Tuning

Random Forest Classifier Tuning Observations

Tuned Random Forest Classifier Feature Importances

Random Forest Classifier Feature Importances Observations

Gradient Boosting Classifier Tuning

Obsevations - Gradient Boosting Tuning

AdaBoost Classifier Tuning

Observations - AdaBoost Tuning

Observations

AdaBoost RandomizedSearchCV

Observations

XGBClassifier using RandomizedSearchCV

Observation

Comparing Random Search Models

Observations

Pipeline for Productionize Model

We will build a simple pipe using the minmaxscaler function and the SVC algorithm

The test score is based upon the model and not the pipe function

Observation

Business Recommendations

Based upon the EDA Insight, a suggested Profile of Customers’ Leaving Thera Bank’s Credit Cards Services would be:

Conclusion